Load libraries
library(rstan)
rstan_options(auto_write = TRUE)
library(bayesplot)
library(gridExtra)
source('monitornew.R')
This note books has experiments for Rank-normalized split-Rhat and -r_eff, where r_reff is the relative efficiency of MCMC sample.
In the code and figures I use abbreviations
The proposal is to switch
Rank normalization makes zsRhat and zsreff well defined and more stable than sRhat and neff. Adding folded version fzsRhat helps to detect differences in scale.
I also propose following warning thresholds
I have plotted dashed lines based on these thresholds (in continuous plots, a dashed line at 1.005 is plotted instead of 1.01, as values larger than that are usually rounded in our summaries to 1.1).
In addition a local efficiency plot is introduced.
This part focus on diagnostics for
To focus on these, independent draws are used as a proxy for very efficient MCMC. In addition the draws are from normal distribution and we discuss the behavior for non-Gaussian distributions later.
res <- data.frame(iters=integer(),trend=double(),rep=integer(),sRhat=double(),zsRhat=double(),fzsRhat=double(),f2zsRhat=double,neff=integer(),zsneff=integer())
chains=4
for (iters in c(250, 1000, 4000)) {
for (a in c(0, 0.25, 0.5, 0.75, 1)) {
for (irep in 1:10) {
r <- array(rnorm(iters*chains), c(iters,chains))
r <- r + seq(-a, a, length.out=iters)
rs <- monitornew(r, warmup=0, digits_summary = 2, probs=c(0.5), print = FALSE)
res<-rbind(res,data.frame(iters=iters,trend=a,rep=irep,Rhat=rs[,'Rhat'],sRhat=rs[,'sRhat'],zsRhat=rs[,'zsRhat'],fzsRhat=rs[,'fzsRhat'],f2zsRhat=rs[,'f2zsRhat'],neff=rs[,'neff'],zsneff=rs[,'zsneff'],reff=rs[,'reff'],zsreff=rs[,'zsreff']))
}
}
}
If we don’t split chains, Rhat misses trends if all chains still have similar marginal distribution.
ggplot(data=res, aes(y=Rhat, x=trend)) + geom_point() + geom_jitter() + facet_grid(. ~ iters) + geom_hline(yintercept = 1.005, linetype = 'dashed') + geom_hline(yintercept = 1) + ggtitle('Rhat without splitting chains')
Split-Rhat can detect trends, even if the marginals of chains are similar.
ggplot(data=res, aes(y=zsRhat, x=trend)) + geom_point() + geom_jitter() + facet_grid(. ~ iters) + geom_hline(yintercept = 1.005, linetype = 'dashed') + geom_hline(yintercept = 1) + ggtitle('Split-Rhat')
Result: Split-Rhat is useful for detecting non-stationarity in chains. If we use threshold 1.01, we can detect trends which have variance 2% from the marginal variance. If we use threshold 1.1, we can detect trends which have variance 30% from the marginal variance.
Relative efficiency (effective sample size divided by the number of draws) is based on split Rhat and autocorrelations.
ggplot(data=res, aes(y=zsreff, x=trend)) + geom_point() + geom_jitter() + facet_grid(. ~ iters) + geom_hline(yintercept = c(0,1)) + geom_hline(yintercept = 0.1, linetype = 'dashed') + ggtitle('Relative efficiency (zsreff)') + scale_y_continuous(breaks = seq(0,1,by=0.25))
Result: Split-Rhat is more sensitive to trends in small sample sizes, but relative efficiency is more sensitive with larger samples sizes (as autocorrelations can be estimated more accurately).
Advice: If in doubt, run longer chains for more accurate estimates.
Next we investigate the sensitivity to detect if one of the chains has not converged to the same distribution as the others, but has a different mean.
res <- data.frame(iters=integer(),trend=double(),rep=integer(),sRhat=double(),zsRhat=double(),fzsRhat=double(),f2zsRhat=double,neff=integer(),zsneff=integer())
chains=4
for (iters in c(250, 1000, 4000)) {
for (a in c(0, 0.25, 0.5, 0.75, 1)) {
for (irep in 1:10) {
r <- array(rnorm(iters*chains), c(iters,chains))
r[,1] <- r[,1] + a
rs <- monitornew(r, warmup=0, digits_summary = 2, probs=c(0.5), print = FALSE)
res<-rbind(res,data.frame(iters=iters,shift=a,rep=irep,Rhat=rs[,'Rhat'],sRhat=rs[,'sRhat'],zsRhat=rs[,'zsRhat'],fzsRhat=rs[,'fzsRhat'],f2zsRhat=rs[,'f2zsRhat'],neff=rs[,'neff'],zsneff=rs[,'zsneff'],reff=rs[,'reff'],zsreff=rs[,'zsreff']))
}
}
}
ggplot(data=res, aes(y=zsRhat, x=shift)) + geom_point() + geom_jitter() + facet_grid(. ~ iters) + geom_hline(yintercept = 1.005, linetype = 'dashed') + geom_hline(yintercept = 1) + ggtitle('Split-Rhat')
Result: If we use threshold 1.01, we can detect shift with magnitude one third of marginal standard deviation. If we use threshold 1.1, we can detect shift with the magnitude equal to marginal standard deviation. The rank plot can be used to visualize where the problem is.
ggplot(data=res, aes(y=zsreff, x=shift)) + geom_point() + geom_jitter() + facet_grid(. ~ iters) + geom_hline(yintercept = c(0,1)) + geom_hline(yintercept = 0.1, linetype = 'dashed') + ggtitle('Relative efficiency (zsreff)') + scale_y_continuous(breaks = seq(0,1,by=0.25))
Result: The relative efficiency is not as sensitive, but a shift with magnitude half of the marginal standard deviation will lead to very low efficiency when sample size increases.
Rank plot visualisation for a case with 4 chains and 250 draws per chain, and shift = 0.5.
iters=250;chains=4;a=0.5;
r <- array(rnorm(iters*chains), c(iters,chains))
r[,1] <- r[,1] + a
colnames(r)<-c('1','2','3','4');
mcmc_hist(r_scale(r), breaks = seq(0,iters*chains,by=iters*chains/20)+0.5)
Rhat is less than 1.05, but the rank plot clearly shows where the problem is.
Next we investigate the sensitivity to detect if one of the chains has not converged to the same distribution as the others but has lower marginal variance.
res <- data.frame(iters=integer(),trend=double(),rep=integer(),sRhat=double(),zsRhat=double(),fzsRhat=double(),f2zsRhat=double,neff=integer(),zsneff=integer())
chains=4
for (iters in c(250, 1000, 4000)) {
for (a in c(0, 0.25, 0.5, 0.75, 1)) {
for (irep in 1:10) {
r <- array(rnorm(iters*chains), c(iters,chains))
r[,1] <- r[,1]*a
rs <- monitornew(r, warmup=0, digits_summary = 2, probs=c(0.5), print = FALSE)
res<-rbind(res,data.frame(iters=iters,scale=a,rep=irep,Rhat=rs[,'Rhat'],sRhat=rs[,'sRhat'],zsRhat=rs[,'zsRhat'],fzsRhat=rs[,'fzsRhat'],f2zsRhat=rs[,'f2zsRhat'],neff=rs[,'neff'],zsneff=rs[,'zsneff'],reff=rs[,'reff'],zsreff=rs[,'zsreff']))
}
}
}
ggplot(data=res, aes(y=zsRhat, x=scale)) + geom_point() + geom_jitter() + facet_grid(. ~ iters) + geom_hline(yintercept = 1.005, linetype = 'dashed') + geom_hline(yintercept = 1) + ggtitle('Split-Rhat')
Result: Split-Rhat is not able to detect differences in scale.
ggplot(data=res, aes(y=fzsRhat, x=scale)) + geom_point() + geom_jitter() + facet_grid(. ~ iters) + geom_hline(yintercept = 1.005, linetype = 'dashed') + geom_hline(yintercept = 1) + ggtitle('Folded-split-Rhat')
Result: Folded-Split-Rhat focuses instead to scales, and can detect scale differences.
Result: If we use threshold 1.01, we can detect a chain with scale less than 3/4 of the scale of the others. If we use threshold 1.1, we can detect a chain with scale less than 1/4 of the scale of the others.
ggplot(data=res, aes(y=zsreff, x=scale)) + geom_point() + geom_jitter() + facet_grid(. ~ iters) + geom_hline(yintercept = c(0,1)) + geom_hline(yintercept = 0.1, linetype = 'dashed') + ggtitle('Relative efficiency (zsreff)') + scale_y_continuous(breaks = seq(0,1,by=0.25))
Result: The relative efficiency does not see a problem as it focuses on location.
Rank plot visualisation for a case with 4 chains and 250 draws per chain, and one chains has scale 0.75.
iters=250;chains=4;a=0.75;
r <- array(rnorm(iters*chains), c(iters,chains))
r[,1] <- r[,1]*a
colnames(r)<-c('1','2','3','4');
mcmc_hist(r_scale(r), breaks = seq(0,iters*chains,by=iters*chains/20)+0.5)
Folded Rhat is 1.06, but the rank plot clearly shows where the problem is.
Classic Rhat is based on calculating within and between chain variances. If the marginal distribution of the chain is such that variance is not defined classic Rhat is not well justified. Next we use Cauchy as an example of such distribution. Also in cases where mean and variance are finite, the distribution can be far from Gaussian and especially distributions with very long tails cause instability for variance and autocorrelation estimates. To alleviate these problems we Split-Rhat for rank-normalized draws. Ranks are computed by using average for ties, as it is deterministic and constant chains are mapped to constant chains.
The following Cauchy models are originally from Michael Betancourt’s case study Fitting The Cauchy Distribution
The the nominal model with direct parameterization is as follows.
writeLines(readLines("cauchy_nom.stan"))
parameters {
vector[50] x;
}
model {
x ~ cauchy(0, 1);
}
generated quantities {
real I = fabs(x[1]) < 1 ? 1 : 0;
}
Run the nominal model
fit_nom <- stan(file='cauchy_nom.stan', seed=7878, refresh = 0)
Warning: There were 1749 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded
Warning: There were 2 chains where the estimated Bayesian Fraction of Missing Information was low. See
http://mc-stan.org/misc/warnings.html#bfmi-low
Warning: Examine the pairs() plot to diagnose sampling problems
The diagnostics warn about very large number of transitions after warmup that exceeded the maximum treedepth, which is likely due to very long tail of Cauchy. All chains have low estimated Bayesian Fraction of Missing Information indicating also slow mixing.
MCMC trace for the first parameter looks wild with occasional large values
samp<-as.array(fit_nom)
mcmc_trace(samp[,,1])
Let’s check Rhat and relative efficiency diagnostics.
res<-monitornew(samp[,,1:50], warmup=0, digits_summary = 2, probs=0.5, print = FALSE);
p1 <- ggplot(data=as.data.frame(res), aes(x=1:50, y=`sRhat`)) + geom_point() + ggtitle('Classic Split-Rhat') + geom_hline(yintercept = 1.005, linetype = 'dashed') + geom_hline(yintercept = 1) + ylim(c(.99 ,1.26))
p2 <- ggplot(data=as.data.frame(res), aes(x=1:50, y=`zsRhat`)) + geom_point() + ggtitle('Rank normalized split-Rhat') + geom_hline(yintercept = 1.005, linetype = 'dashed') + geom_hline(yintercept = 1) + ylim(c(.99,1.26))
p3 <- ggplot(data=as.data.frame(res), aes(x=1:50, y=`fzsRhat`)) + geom_point() + ggtitle('Folded rank split-Rhat') + geom_hline(yintercept = 1.005, linetype = 'dashed') + geom_hline(yintercept = 1) + ylim(c(.99,1.26))
grid.arrange(p1, p2, p3, nrow = 1)
For one parameter Rhats exceed the classic threshold 1.1. Depending on Rhat, a few others exceed the threshold 1.01. The interpretation of classic Rhat is however not well defined. The rank normalized split-Rhat have several values over 1.01.
res<-monitornew(samp[,,1:50], warmup=0, digits_summary = 2, probs=0.5, print = FALSE);
p1 <- ggplot(data=as.data.frame(res), aes(x=1:50, y=reff)) + geom_point() + ggtitle('Classic relative efficiency (reff)') + geom_hline(yintercept = c(0,1)) + geom_hline(yintercept = 0.1, linetype = 'dashed') + scale_y_continuous(breaks = seq(0,1,by=0.25), limits = c(0,1.1))
p2 <- ggplot(data=as.data.frame(res), aes(x=1:50, y=zsreff)) + geom_point() + ggtitle('New relative efficiency (zsreff)') + geom_hline(yintercept = c(0,1)) + geom_hline(yintercept = 0.1, linetype = 'dashed') + scale_y_continuous(breaks = seq(0,1,by=0.25), limits = c(0,1.1))
grid.arrange(p1, p2, nrow = 1)
Both classic and rank normalized relative efficiency estimates have several near zero values, and the overall samples shouldn’t be trusted.
Result: Relative efficiency is more sensitive than (rank-normalized) split-Rhat to detect problems of slow mixing.
We also check the indicator function = and lp__
res<-monitornew(samp[,,51:52], warmup=0, digits_summary = 2, probs=0.5, print = FALSE);
cat('I: r_eff = ', round(res['I','zsreff'],2), '\n')
I: r_eff = 0.14
cat('lp__: r_eff = ', round(res['lp__','zsreff'],2), '\n')
lp__: r_eff = 0.05
Relative efficiency for lp__ is worryingly low.
We can also examine the relative efficiency in different parts of the posterior, by computing the relative efficiency for \(p(\hat{Q}_\alpha \le x < \hat{Q}_{\alpha+0.05})\), where \(\hat{Q}_\alpha\) is empirical \(\alpha\)-quantile and \(\alpha \in (0,0.05,\ldots,0.95)\). Each interval has \(5%\) of draws, and the efficiency measures autocorrelation of an indicator function which indicates when the values are inside the specific interval. This gives us local efficiency measure which is does not depend on the shape of the distribution.
plot_local_reff <- function(samp = NULL, parami = NULL, nalpha=NULL) {
zsreffs <- double()
delta <- 1/nalpha
for (alpha in seq(0, 1-delta, by=delta)) {
rs <- monitornew(samp[,,parami]>=quantile(samp[,,parami],alpha)&samp[,,parami]<quantile(samp[,,parami],alpha+delta),warmup = 0, probs = 0.5, digits_summary = 2, print = FALSE)
zsreffs <- c(zsreffs, rs[1,'zsreff'])
}
df <- data.frame(quantile=seq(0, 1, by=delta), zsreff=c(zsreffs,zsreffs[nalpha]))
ggplot(data=df, aes(x=quantile, y=zsreff)) +
geom_step() +
geom_hline(yintercept = c(0,1)) +
geom_hline(yintercept = 0.1, linetype = 'dashed') +
scale_x_continuous(breaks=seq(0,1, by=0.1)) +
scale_y_continuous(breaks = seq(0,1,by=0.25), limits = c(0,1.1))
}
plot_local_reff(samp = samp, parami = 1, nalpha = 20)
We see that MCMC is less efficient in tails.
Rank plot visualisation of x[1], which has relative efficiency 0.39
mcmc_hist(r_scale(samp[,,1]), breaks = seq(0,prod(dim(samp)[1:2]),by=prod(dim(samp)[1:2])/50)+0.5)
Chains have clearly different
Rank plot visualisation of x[7], which has relative efficiency 0.02
mcmc_hist(r_scale(samp[,,7]), breaks = seq(0,prod(dim(samp)[1:2]),by=prod(dim(samp)[1:2])/50)+0.5)
There are clear problems with mixing.
max_treedepth=20We can try to improve the performance by increasing the maximum treedepth.
fit_nom_td20 <- stan(file='cauchy_nom.stan', seed=7878, refresh = 0, control = list(max_treedepth=20))
Warning: There were 3 chains where the estimated Bayesian Fraction of Missing Information was low. See
http://mc-stan.org/misc/warnings.html#bfmi-low
Warning: Examine the pairs() plot to diagnose sampling problems
MCMC trace for the first parameter looks wild with occasional large values
samp<-as.array(fit_nom_td20)
mcmc_trace(samp[,,1])
Treedepth exceedence and Bayesian Fraction of Missing Information are HMC/NUTS specific diagnostics. Let’s see what generic Split-Rhat and relative efficiency can detect. Check the diagnostics for all \(x\)’s.
res<-monitornew(samp[,,1:50], warmup=0, digits_summary = 2, probs=0.5, print = FALSE);
p1 <- ggplot(data=as.data.frame(res), aes(x=1:50, y=`sRhat`)) + geom_point() + ggtitle('Classic Split-Rhat') + geom_hline(yintercept = 1.005, linetype = 'dashed') + geom_hline(yintercept = 1) + ylim(0.999, 1.045)
p2 <- ggplot(data=as.data.frame(res), aes(x=1:50, y=`zsRhat`)) + geom_point() + ggtitle('Rank normalized split-Rhat') + geom_hline(yintercept = 1.005, linetype = 'dashed') + geom_hline(yintercept = 1) + ylim(0.999, 1.045)
p3 <- ggplot(data=as.data.frame(res), aes(x=1:50, y=`fzsRhat`)) + geom_point() + ggtitle('Folded rank split-Rhat') + geom_hline(yintercept = 1.005, linetype = 'dashed') + geom_hline(yintercept = 1) + ylim(0.999, 1.045)
grid.arrange(p1, p2, p3, nrow = 1)
All Rhats are below 1.1, but many are over 1.01. Classic Rhat has more variation than the rank normalized (and the classic is not well defined). The folded rank normalized Rhat shows that there is still more variation in scale than location between different chains.
res<-monitornew(samp[,,1:50], warmup=0, digits_summary = 2, probs=0.5, print = FALSE);
p1 <- ggplot(data=as.data.frame(res), aes(x=1:50, y=reff)) + geom_point() + ggtitle('Classic relative efficiency (reff)') + geom_hline(yintercept = c(0,1)) + geom_hline(yintercept = 0.1, linetype = 'dashed') + scale_y_continuous(breaks = seq(0,1.5,by=0.25), limits = c(0,1.5))
p2 <- ggplot(data=as.data.frame(res), aes(x=1:50, y=zsreff)) + geom_point() + ggtitle('New relative efficiency (zsreff)') + geom_hline(yintercept = c(0,1)) + geom_hline(yintercept = 0.1, linetype = 'dashed') + scale_y_continuous(breaks = seq(0,1.5,by=0.25), limits = c(0,1.5))
grid.arrange(p1, p2, nrow = 1)
Some classic relative efficiencies (reffs) are close to zero. If we wouldn’t realize that the variance is infinite, we might try to run longer chains, but in case of infinite variance zero is the true value and longer chains don’t help. However other quantities can be well defined, and that’s why it can be useful also to look at the rank normalized version as a generic transformation with finite mean and variance. The smallest rank-normalized r_eff are around 0.25, which is not that bad. There is still large variation between parameters although they all have the same distribution.
Result: Rank normalized relative efficiency is more stable than not-rank-normalized which is not well defined for Cauchy.
We also check the indicator function = and lp__
res<-monitornew(samp[,,51:52], warmup=0, digits_summary = 2, probs=0.5, print = FALSE);
cat('I: r_eff = ', round(res['I','zsreff'],2), '\n')
I: r_eff = 0.13
cat('lp__: r_eff = ', round(res['lp__','zsreff'],2), '\n')
lp__: r_eff = 0.06
Although increasing max treedepth improved diagnostics for x, the efficiency for I and lp__ didn’t change.
We can also examine the relative efficiency in different parts of the posterior, by computing the relative efficiency for \(p(\hat{Q}_\alpha \le x < \hat{Q}_{\alpha+0.05})\), where \(\hat{Q}_\alpha\) is \(\alpha\)-quantile and \(\alpha \in (0,0.05,\ldots,0.95)\).
plot_local_reff(samp = samp, parami = 1, nalpha = 20)
Increasing max treedepth improved the efficiency slightly in the tails.
Rank plot visualisation of x[38], which has the smallest relative efficiency 0.23 among x’s.
mcmc_hist(r_scale(samp[,,38]), breaks = seq(0,prod(dim(samp)[1:2]),by=prod(dim(samp)[1:2])/50)+0.5)
Clear problems.
Rank plot visualisation of lp__, which has relative efficiency 0.06.
mcmc_hist(r_scale(samp[,,52]), breaks = seq(0,prod(dim(samp)[1:2]),by=prod(dim(samp)[1:2])/50)+0.5)
Doesn’t look so good.
max_treedepth=20 + 8 times longer samplingLet’s try running longer chain
fit_nom_td20l <- stan(file='cauchy_nom.stan', seed=7878, refresh = 0, control = list(max_treedepth=20), warmup=1000, iter = 9000)
Warning: There were 3 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 20. See
http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded
Warning: There were 4 chains where the estimated Bayesian Fraction of Missing Information was low. See
http://mc-stan.org/misc/warnings.html#bfmi-low
Warning: Examine the pairs() plot to diagnose sampling problems
MCMC trace for the first parameter looks wild with occasional large values
samp<-as.array(fit_nom_td20l)
mcmc_trace(samp[,,1])
Treedepth exceedence and Bayesian Fraction of Missing Information are HMC/NUTS specific diagnostics. Let’s see what generic Split-Rhat and relative efficiency can detect. Check the diagnostics for all \(x\)’s.
res<-monitornew(samp[,,1:50], warmup=0, digits_summary = 2, probs=0.5, print = FALSE);
p1 <- ggplot(data=as.data.frame(res), aes(x=1:50, y=`sRhat`)) + geom_point() + ggtitle('Classic split-Rhat') + geom_hline(yintercept = 1.005, linetype = 'dashed') + geom_hline(yintercept = 1) + ylim(.999, 1.009)
p2 <- ggplot(data=as.data.frame(res), aes(x=1:50, y=`zsRhat`)) + geom_point() + ggtitle('Rank norm. split-Rhat') + geom_hline(yintercept = 1.005, linetype = 'dashed') + geom_hline(yintercept = 1) + ylim(.999, 1.009)
p3 <- ggplot(data=as.data.frame(res), aes(x=1:50, y=`fzsRhat`)) + geom_point() + ggtitle('Fold. rank split-Rhat') + geom_hline(yintercept = 1.005, linetype = 'dashed') + geom_hline(yintercept = 1) + ylim(.999, 1.009)
grid.arrange(p1, p2, p3, nrow = 1)
All Rhats are below 1.01. Classic Rhat has more variation than the rank normalized (but the classic is not well defined).
res<-monitornew(samp[,,1:50], warmup=0, digits_summary = 2, probs=0.5, print = FALSE);
p1 <- ggplot(data=as.data.frame(res), aes(x=1:50, y=reff)) + geom_point() + ggtitle('Classic relative efficiency (reff)') + geom_hline(yintercept = c(0,1)) + geom_hline(yintercept = 0.1, linetype = 'dashed') + ylim(0,1)
p2 <- ggplot(data=as.data.frame(res), aes(x=1:50, y=zsreff)) + geom_point() + ggtitle('New relative efficiency (zsreff)') + geom_hline(yintercept = c(0,1)) + geom_hline(yintercept = 0.1, linetype = 'dashed') + ylim(0,1)
grid.arrange(p1, p2, nrow = 1)
Most classic relative efficiencies (reffs) are close to zero. If we wouldn’t realize that the variance is infinite, we might try to run longer chains, but in case of infinite variance zero is the true value and longer chains don’t help. However other quantities can be well defined, and that’s why it can be useful also to look at the rank normalized version as a generic transformation with finite mean and variance. The rank-normalized r_eff’s start to concentrate around 0.6, which is quite good. There is still some variation between parameters.
Result: Rank normalized relative efficiency is more stable than not-rank-normalized which is not well defined for Cauchy.
We also check the indicator function = and lp__
res<-monitornew(samp[,,51:52], warmup=0, digits_summary = 2, probs=0.5, print = FALSE);
cat('I: r_eff = ', round(res['I','reff'],2), '\n')
I: r_eff = 0.16
cat('lp__: r_eff = ', round(res['lp__','zsreff'],2), '\n')
lp__: r_eff = 0.04
We can also examine the relative efficiency in different parts of the posterior, by computing the relative efficiency for \(p(\hat{Q}_\alpha \le x < \hat{Q}_{\alpha+0.05})\), where \(\hat{Q}_\alpha\) is \(\alpha\)-quantile and \(\alpha \in (0,0.05,\ldots,0.95)\).
plot_local_reff(samp = samp, parami = 1, nalpha = 20)
Increasing chain length did not seem to change the relative efficiency. With more draws from the longer chains we can use also finer resolution for the local efficiency estimates.
plot_local_reff(samp = samp, parami = 1, nalpha = 100)
The efficiency far in the tails is worryingly low.
Rank plot visualisation of x[34], which has the smallest relative efficiency 0.22 among x’s.
mcmc_hist(r_scale(samp[,,34]), breaks = seq(0,prod(dim(samp)[1:2]),by=prod(dim(samp)[1:2])/100)+0.5)
Still problems at the edges.
Rank plot visualisation of lp__, which has relative efficiency 0.04.
mcmc_hist(r_scale(samp[,,52]), breaks = seq(0,prod(dim(samp)[1:2]),by=prod(dim(samp)[1:2])/100)+0.5)
There seems to be slight problems also in mixing for energy.
Next we examine alternative parameterization and consider Cauchy as a scale mixture of Gaussian distributions. The model has two parameters and the Cauchy distributed x’s can be computed from those. In addition of improved sampling performance, the example illustrates that focus of diagnostics matter.
writeLines(readLines("cauchy_alt_1.stan"))
parameters {
vector[50] x_a;
vector<lower=0>[50] x_b;
}
transformed parameters {
vector[50] x = x_a ./ sqrt(x_b);
}
model {
x_a ~ normal(0, 1);
x_b ~ gamma(0.5, 0.5);
}
generated quantities {
real I = fabs(x[1]) < 1 ? 1 : 0;
}
Run the alternative model
fit_1 <- stan(file='cauchy_alt_1.stan', seed=7878, refresh = 0)
There are no warnings, and the sampling is much faster.
samp<-as.array(fit_1)
res<-monitornew(samp[,,101:150], warmup=0, digits_summary = 2, probs=0.5, print = FALSE);
p1 <- ggplot(data=as.data.frame(res), aes(x=1:50, y=`sRhat`)) + geom_point() + ggtitle('Classic split-Rhat') + geom_hline(yintercept = 1.005, linetype = 'dashed') + geom_hline(yintercept = 1)
p2 <- ggplot(data=as.data.frame(res), aes(x=1:50, y=`zsRhat`)) + geom_point() + ggtitle('Rank norm. split-Rhat') + geom_hline(yintercept = 1.005, linetype = 'dashed') + geom_hline(yintercept = 1)
p3 <- ggplot(data=as.data.frame(res), aes(x=1:50, y=`fzsRhat`)) + geom_point() + ggtitle('Fold. rank split-Rhat') + geom_hline(yintercept = 1.005, linetype = 'dashed') + geom_hline(yintercept = 1)
grid.arrange(p1, p2, p3, nrow = 1)
All Rhats are below 1.01. Classic Rhat’s look also good even if the classic is not well defined for Cauchy distributed.
res<-monitornew(samp[,,101:150], warmup=0, digits_summary = 2, probs=0.5, print = FALSE);
p1 <- ggplot(data=as.data.frame(res), aes(x=1:50, y=reff)) + geom_point() + ggtitle('Classic relative efficiency (reff)') + geom_hline(yintercept = c(0,1)) + ylim(0,3.4)
p2 <- ggplot(data=as.data.frame(res), aes(x=1:50, y=zsreff)) + geom_point() + ggtitle('New relative efficiency (zsreff)') + geom_hline(yintercept = c(0,1)) + ylim(0,3.4)
grid.arrange(p1, p2, nrow = 1)
Result: Rank normalized r_eff’s have less variation than classic one which is not well defined for Cauchy.
We also check the indicator function = and lp__
res<-monitornew(samp[,,151:152], warmup=0, digits_summary = 2, probs=0.5, print = FALSE);
cat('I: r_eff = ', round(res['I','zsreff'],2), '\n')
I: r_eff = 0.75
cat('lp__: r_eff = ', round(res['lp__','zsreff'],2), '\n')
lp__: r_eff = 0.33
The relative efficiency for these are also much better than with nominal parameterization.
We can also examine the relative efficiency in different parts of the posterior, by computing the relative efficiency for \(p(\hat{Q}_\alpha \le x < \hat{Q}_{\alpha+0.05})\), where \(\hat{Q}_\alpha\) is \(\alpha\)-quantile and \(\alpha \in (0,0.05,\ldots,0.95)\).
samp <- as.array(fit_1)
plot_local_reff(samp = samp, parami = 101, nalpha = 20)
The relative efficiency is good in all parts of the posterior.
res<-monitornew(samp[,,101:150], warmup=0, digits_summary = 2, probs=0.5, print = FALSE);
res1<-monitornew(samp[,,1:50], warmup=0, digits_summary = 2, probs=0.5, print = FALSE);
res2<-monitornew(samp[,,51:100], warmup=0, digits_summary = 2, probs=0.5, print = FALSE);
cat('Mean r_eff for xs = ' , round(mean(res[,'zsreff']),2), '\n')
Mean r_eff for xs = 1.03
cat('Mean r_eff for x_as = ' , round(mean(res1[,'zsreff']),2), '\n')
Mean r_eff for x_as = 1.15
cat('Mean r_eff for x_bs = ' , round(mean(res2[,'zsreff']),2), '\n')
Mean r_eff for x_bs = 0.69
Result: We see that, the relative efficiency of the interesting x, can be different from the relative efficiency of the parameters that used to compute it. Also r_eff we are computing is for rank normalized and to compute Monte Carlo errors, the final target function should be taken into account, but we just want to avoid the complication of the potentially non-existing moments at this point, and come back later in other paper how to use Pareto diagnostic to check whether necessary moments exist to compute the desired expectations and their Monte Carlo errors.
Rank plot visualisation of x[9], which has the smallest relative efficiency 0.85 among x’s.
mcmc_hist(r_scale(samp[,,109]), breaks = seq(0,prod(dim(samp)[1:2]),by=prod(dim(samp)[1:2])/50)+0.5)
Looks better than for the nominal parameterization.
Rank plot visualisation of lp__, which has relative efficiency 0.33.
mcmc_hist(r_scale(samp[,,152]), breaks = seq(0,prod(dim(samp)[1:2]),by=prod(dim(samp)[1:2])/50)+0.5)
Looks better than for the nominal parameterization. It would be useful to have the reference plotted.
Another alternative parameterization is univariate transformation as shown in the following code.
writeLines(readLines("cauchy_alt_3.stan"))
parameters {
vector<lower=0, upper=1>[50] x_tilde;
}
transformed parameters {
vector[50] x = tan(pi() * (x_tilde - 0.5));
}
model {
// Implicit uniform prior on x_tilde
}
generated quantities {
real I = fabs(x[1]) < 1 ? 1 : 0;
}
Run the alternative model (3rd alternative in Michael Betancourt’s case study)
fit_3 <- stan(file='cauchy_alt_3.stan', seed=7878, refresh = 0)
There are no warnings, and the sampling is much faster than for the nominal model.
samp<-as.array(fit_3)
res<-monitornew(samp[,,51:100], warmup=0, digits_summary = 2, probs=0.5, print = FALSE);
p1 <- ggplot(data=as.data.frame(res), aes(x=1:50, y=`sRhat`)) + geom_point() + ggtitle('Classic split-Rhat') + geom_hline(yintercept = 1.005, linetype = 'dashed') + geom_hline(yintercept = 1) + ylim(0.999, 1.007)
p2 <- ggplot(data=as.data.frame(res), aes(x=1:50, y=`zsRhat`)) + geom_point() + ggtitle('Rank norm. split-Rhat') + geom_hline(yintercept = 1.005, linetype = 'dashed') + geom_hline(yintercept = 1) + ylim(0.999, 1.007)
p3 <- ggplot(data=as.data.frame(res), aes(x=1:50, y=`fzsRhat`)) + geom_point() + ggtitle('Fold. rank split-Rhat') + geom_hline(yintercept = 1.005, linetype = 'dashed') + geom_hline(yintercept = 1) + ylim(0.999, 1.007)
grid.arrange(p1, p2, p3, nrow = 1)
All Rhats except some folded Rhats are below 1.01. Classic Rhat’s look also good even if the classic is not well defined for Cauchy distributed.
res<-monitornew(samp[,,51:100], warmup=0, digits_summary = 2, probs=0.5, print = FALSE);
p1 <- ggplot(data=as.data.frame(res), aes(x=1:50, y=`reff`)) + geom_point() + ggtitle('Classic relative efficiency (reff)') + geom_hline(yintercept = c(0,1)) + geom_hline(yintercept = 0.1, linetype = 'dashed') + ylim(0,3.4)
p2 <- ggplot(data=as.data.frame(res), aes(x=1:50, y=`zsreff`)) + geom_point() + ggtitle('New relative efficiency (zsreff)') + geom_hline(yintercept = c(0,1)) + geom_hline(yintercept = 0.1, linetype = 'dashed') + ylim(0,3.4)
grid.arrange(p1, p2, nrow = 1)
Result: Rank normalized r_eff’s have less variation than classic one which is not well defined for Cauchy. Rank normalized r_eff’s are slightly larger than 1, which is possible for antithetic Markov chains which have negative correlation for odd lags.
We also check the indicator function = and lp__
res<-monitornew(samp[,,101:102], warmup=0, digits_summary = 2, probs=0.5, print = FALSE);
cat('I: r_eff = ', round(res['I','zsreff'],2), '\n')
I: r_eff = 0.43
cat('lp__: r_eff = ', round(res['lp__','zsreff'],2), '\n')
lp__: r_eff = 0.36
The relative efficiency for these are also much better than with nominal parameterization.
We can also examine the relative efficiency in different parts of the posterior, by computing the relative efficiency for \(p(\hat{Q}_\alpha \le x < \hat{Q}_{\alpha+0.05})\), where \(\hat{Q}_\alpha\) is \(\alpha\)-quantile and \(\alpha \in (0,0.05,\ldots,0.95)\).
samp <- as.array(fit_3)
plot_local_reff(samp = samp, parami = 1, nalpha = 20)
The relative efficiency in tails is worse than for the first alternative parameterization, although it’s still better than for the nominal model.
res<-monitornew(samp[,,51:100], warmup=0, digits_summary = 2, probs=0.5, print = FALSE);
res1<-monitornew(samp[,,1:50], warmup=0, digits_summary = 2, probs=0.5, print = FALSE);
cat('Mean r_eff for xs = ' , round(mean(res[,'zsreff']),2), '\n')
Mean r_eff for xs = 1.25
cat('Mean r_eff for x_as = ' , round(mean(res1[,'zsreff']),2), '\n')
Mean r_eff for x_as = 1.25
Result: When the transformation is univariate and bijective the rank normalized Rhat and r_eff are transformation invariant.
Naturally when computing the desired expectations the final target function should be taken into account, but we just want to avoid the complication of the potentially non-existing moments at this point, and come back later in other paper how to use Pareto diagnostic to check whether necessary moments exist to compute the desired expectations and their Monte Carlo errors.
Rank plot visualisation of x[13], which has the smallest relative efficiency 1.0 among x’s.
mcmc_hist(r_scale(samp[,,63]), breaks = seq(0,prod(dim(samp)[1:2]),by=prod(dim(samp)[1:2])/50)+0.5)
Nothing special.
Rank plot visualisation of lp__, which has relative efficiency 0.33.
mcmc_hist(r_scale(samp[,,102]), breaks = seq(0,prod(dim(samp)[1:2]),by=prod(dim(samp)[1:2])/50)+0.5)
Nothing special.
Half-Cauchy priors are common and, for example, in Stan usually set apparently using the nominal parameterization. However, when the constraint <lower=0> is used, Stan does the sampling automatically in the unconstrained log(x) space, which changes the geometry crucially.
writeLines(readLines("half_cauchy_nom.stan"))
parameters {
vector<lower=0>[50] x;
}
model {
x ~ cauchy(0, 1);
}
generated quantities {
real I = fabs(x[1]) < 1 ? 1 : 0;
}
Run the half-Cauchy with nominal parameterization (and positive constraint)
fit_half_nom <- stan(file='half_cauchy_nom.stan', seed=7878, refresh = 0)
There are no warnings, and the sampling is much faster than for the Cauchy nominal model.
samp<-as.array(fit_half_nom)
res<-monitornew(samp[,,1:50], warmup=0, digits_summary = 2, probs=0.5, print = FALSE);
p1 <- ggplot(data=as.data.frame(res), aes(x=1:50, y=`sRhat`)) + geom_point() + ggtitle('Classic split-Rhat') + geom_hline(yintercept = 1.005, linetype = 'dashed') + geom_hline(yintercept = 1)
p2 <- ggplot(data=as.data.frame(res), aes(x=1:50, y=`zsRhat`)) + geom_point() + ggtitle('Rank norm. split-Rhat') + geom_hline(yintercept = 1.005, linetype = 'dashed') + geom_hline(yintercept = 1)
p3 <- ggplot(data=as.data.frame(res), aes(x=1:50, y=`fzsRhat`)) + geom_point() + ggtitle('Fold. rank split-Rhat') + geom_hline(yintercept = 1.005, linetype = 'dashed') + geom_hline(yintercept = 1)
grid.arrange(p1, p2, p3, nrow = 1)
All Rhats are below 1.01. Classic Rhat’s look also good even if the classic is not well defined for half-Cauchy distribution.
res<-monitornew(samp[,,1:50], warmup=0, digits_summary = 2, probs=0.5, print = FALSE);
p1 <- ggplot(data=as.data.frame(res), aes(x=1:50, y=`reff`)) + geom_point() + ggtitle('Classic relative efficiency (reff)') + geom_hline(yintercept = c(0,1)) + geom_hline(yintercept = 0.1, linetype = 'dashed') + ylim(0,2.2)
p2 <- ggplot(data=as.data.frame(res), aes(x=1:50, y=`zsreff`)) + geom_point() + ggtitle('New relative efficiency (zsreff)') + geom_hline(yintercept = c(0,1)) + geom_hline(yintercept = 0.1, linetype = 'dashed') + ylim(0,2.2)
grid.arrange(p1, p2, nrow = 1)
Result: Rank normalized r_eff’s have less variation than classic one which is not well defined for Cauchy. Rank normalized r_eff’s are much larger than 1, which is possible for antithetic Markov chains which have negative correlation for odd lags.
Due to constraint ‘log(x) space, and we can also check the performance in that space.
res<-monitornew(log(samp[,,1:50]), warmup=0, digits_summary = 2, probs=0.5, print = FALSE);
p1 <- ggplot(data=as.data.frame(res), aes(x=1:50, y=`reff`)) + geom_point() + ggtitle('Classic relative efficiency (reff)') + geom_hline(yintercept = c(0,1)) + geom_hline(yintercept = 0.1, linetype = 'dashed') + ylim(0,2.2)
p2 <- ggplot(data=as.data.frame(res), aes(x=1:50, y=`zsreff`)) + geom_point() + ggtitle('New relative efficiency (zsreff)') + geom_hline(yintercept = c(0,1)) + geom_hline(yintercept = 0.1, linetype = 'dashed') + ylim(0,2.2)
grid.arrange(p1, p2, nrow = 1)
\(log(x)\) is quite close to Gaussian, and thus classic r_eff is also close to rank normalized r_eff which is exactly same as for \(x\) as the rank normalized version is invariant to bijective transformations.
Result: Rank normalized r_eff is close to classic r_eff for transformations which make the distribution close to Gaussian.
We also check the indicator function = and lp__
res<-monitornew(samp[,,51:52], warmup=0, digits_summary = 2, probs=0.5, print = FALSE);
cat('I: r_eff = ', round(res['I','zsreff'],2), '\n')
I: r_eff = 1.33
cat('lp__: r_eff = ', round(res['lp__','zsreff'],2), '\n')
lp__: r_eff = 0.27
Result: Relative efficiency for \(p(x_1<1)\) is larger than 1. This is possible for antithetic Markov chains, which have negative correlation for odd lags.
We can also examine the relative efficiency in different parts of the posterior, by computing the relative efficiency for \(p(\hat{Q}_\alpha \le x < \hat{Q}_{\alpha+0.05})\), where \(\hat{Q}_\alpha\) is \(\alpha\)-quantile and \(\alpha \in (0,0.05,\ldots,0.95)\).
samp <- as.array(fit_half_nom)
plot_local_reff(samp = samp, parami = 1, nalpha = 20)
The relative efficiency is good overall, with only a small dip in tails.
Rank plot visualisation of x[32], which has the smallest relative efficiency 1.0 among x’s.
mcmc_hist(r_scale(samp[,,32]), breaks = seq(0,prod(dim(samp)[1:2]),by=prod(dim(samp)[1:2])/50)+0.5)
Looks like there are differences in scale of chains.
Rank plot visualisation of lp__, which has relative efficiency 0.33.
mcmc_hist(r_scale(samp[,,52]), breaks = seq(0,prod(dim(samp)[1:2]),by=prod(dim(samp)[1:2])/50)+0.5)
Maybe small differences in scales, but it’s difficult to know whether small variation from uniform is relevant.
writeLines(readLines("half_cauchy_alt.stan"))
parameters {
vector<lower=0>[50] x_a;
vector<lower=0>[50] x_b;
}
transformed parameters {
vector[50] x = x_a .* sqrt(x_b);
}
model {
x_a ~ normal(0, 1);
x_b ~ inv_gamma(0.5, 0.5);
}
generated quantities {
real I = fabs(x[1]) < 1 ? 1 : 0;
}
Run half-Cauchy with alternative parameterization
fit_half_reparam <- stan(file='half_cauchy_alt.stan', seed=7878, refresh = 0)
There are no warnings, and the sampling is as fast for the half-Cauchy nominal model.
samp<-as.array(fit_half_reparam)
res<-monitornew(samp[,,101:150], warmup=0, digits_summary = 2, probs=0.5, print = FALSE);
p1 <- ggplot(data=as.data.frame(res), aes(x=1:50, y=`sRhat`)) + geom_point() + ggtitle('Classic split-Rhat') + geom_hline(yintercept = 1.005, linetype = 'dashed') + geom_hline(yintercept = 1)
p2 <- ggplot(data=as.data.frame(res), aes(x=1:50, y=`zsRhat`)) + geom_point() + ggtitle('Rank norm. split-Rhat') + geom_hline(yintercept = 1.005, linetype = 'dashed') + geom_hline(yintercept = 1)
p3 <- ggplot(data=as.data.frame(res), aes(x=1:50, y=`fzsRhat`)) + geom_point() + ggtitle('Fold. rank split-Rhat') + geom_hline(yintercept = 1.005, linetype = 'dashed') + geom_hline(yintercept = 1)
grid.arrange(p1, p2, p3, nrow = 1)
All Rhats are below 1.01. Classic Rhat’s look also good even if the classic is not well defined for half-Cauchy distribution.
res<-monitornew(samp[,,101:150], warmup=0, digits_summary = 2, probs=0.5, print = FALSE);
p1 <- ggplot(data=as.data.frame(res), aes(x=1:50, y=`reff`)) + geom_point() + ggtitle('Classic relative efficiency (reff)') + geom_hline(yintercept = c(0,1)) + geom_hline(yintercept = 0.1, linetype = 'dashed') + ylim(0,1)
p2 <- ggplot(data=as.data.frame(res), aes(x=1:50, y=`zsreff`)) + geom_point() + ggtitle('New relative efficiency (zsreff)') + geom_hline(yintercept = c(0,1)) + geom_hline(yintercept = 0.1, linetype = 'dashed') + ylim(0,1)
grid.arrange(p1, p2, nrow = 1)
Result: Rank normalized r_eff’s have less variation than classic one which is not well defined for Cauchy. Based on rank normalized r_eff the alternative parameterization has much lower efficiency than the nominal parameterization with constraint <lower=0>.
We also check the indicator function = and lp__
res<-monitornew(samp[,,151:152], warmup=0, digits_summary = 2, probs=0.5, print = FALSE);
cat('I: r_eff = ', round(res['I','zsreff'],2), '\n')
I: r_eff = 0.78
cat('lp__: r_eff = ', round(res['lp__','zsreff'],2), '\n')
lp__: r_eff = 0.24
We can also examine the relative efficiency in different parts of the posterior, by computing the relative efficiency for \(p(\hat{Q}_\alpha \le x < \hat{Q}_{\alpha+0.05})\), where \(\hat{Q}_\alpha\) is \(\alpha\)-quantile and \(\alpha \in (0,0.05,\ldots,0.95)\).
samp <- as.array(fit_half_reparam)
plot_local_reff(samp = samp, parami = 101, nalpha = 20)
The relative efficiency near zero is much worse than for the half-Cauchy with nominal parameterization and constraint <lower=0>.
Rank plot visualisation of x[15], which has the smallest relative efficiency 0.30 among x’s.
mcmc_hist(r_scale(samp[,,115]), breaks = seq(0,prod(dim(samp)[1:2]),by=prod(dim(samp)[1:2])/50)+0.5)
Varies from uniform, which is expected with lower relative efficiency, but would require a reference.
Rank plot visualisation of lp__, which has relative efficiency 0.24.
mcmc_hist(r_scale(samp[,,152]), breaks = seq(0,prod(dim(samp)[1:2]),by=prod(dim(samp)[1:2])/50)+0.5)
Varies from uniform, which is expected with lower relative efficiency, but would require a reference.
The models are from Michael Betancourt’s case study on Diagnosing Biased Inference with Divergences.
writeLines(readLines("eight_schools_cp.stan"))
data {
int<lower=0> J;
real y[J];
real<lower=0> sigma[J];
}
parameters {
real mu;
real<lower=0> tau;
real theta[J];
}
model {
mu ~ normal(0, 5);
tau ~ cauchy(0, 5);
theta ~ normal(mu, tau);
y ~ normal(theta, sigma);
}
Run centered parameterization model with default MCMC options.
input_data <- read_rdump("eight_schools.data.R")
fit_cp <- stan(file='eight_schools_cp.stan', data=input_data,
iter=2000, chains=4, seed=483892929, refresh=0)
Warning: There were 50 divergent transitions after warmup. Increasing adapt_delta above 0.8 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
Warning: There were 2 chains where the estimated Bayesian Fraction of Missing Information was low. See
http://mc-stan.org/misc/warnings.html#bfmi-low
Warning: Examine the pairs() plot to diagnose sampling problems
We do observe divergent transitions, which is sensitive diagnostic for this model and we observer also low BFMI, but let’s assume we were investigating a MCMC algorithm without such diagnostic.
res<-monitornew(fit_cp, warmup=0, digits_summary = 2, probs=0.5, print = FALSE)
round(res[,c('sRhat','zsRhat','fzsRhat','reff','zsreff')],2)
sRhat zsRhat fzsRhat reff zsreff
mu 1.01 1.01 1.00 0.16 0.16
tau 1.02 1.02 1.00 0.07 0.05
theta[1] 1.00 1.00 1.01 0.28 0.26
theta[2] 1.00 1.00 1.01 0.27 0.26
theta[3] 1.01 1.01 1.01 0.21 0.21
theta[4] 1.00 1.00 1.01 0.29 0.28
theta[5] 1.01 1.01 1.01 0.22 0.22
theta[6] 1.00 1.00 1.01 0.28 0.27
theta[7] 1.00 1.00 1.01 0.23 0.22
theta[8] 1.00 1.00 1.01 0.31 0.28
lp__ 1.03 1.03 1.01 0.04 0.05
Some rank-normalized split-Rhats are larger than 1.01. Relative efficiencies for tau and lp__ are less than 10%, which is worry-some and longer chains should be run.
We examine the relative efficiency of probabilities between quantiles (0, 0.05, …, 1) for tau.
samp <- as.array(fit_cp)
plot_local_reff(samp = samp, parami = 2, nalpha = 20)
We see that MCMC has difficulties in exploring small tau values. As the efficiency for estimating small tau values is practically zero, we may assume that we may miss substantial amount of posterior mass and get biased estimates.
Rank plot visualisation of tau, which has the smallest relative efficiency 0.05 among parameters.
mcmc_hist(r_scale(samp[,,2]), breaks = seq(0,prod(dim(samp)[1:2]),by=prod(dim(samp)[1:2])/50)+0.5)
Clear problems.
Rank plot visualisation of lp__, which has relative efficiency 0.05.
mcmc_hist(r_scale(samp[,,11]), breaks = seq(0,prod(dim(samp)[1:2]),by=prod(dim(samp)[1:2])/50)+0.5)
Clear problems.
Low r_eff can be sometimes compensated with longer chains. Let’s check 10 times longer chain.
fit_cp2 <- stan(file='eight_schools_cp.stan', data=input_data,
iter=20000, chains=4, seed=483892929, refresh=0)
Warning: There were 1179 divergent transitions after warmup. Increasing adapt_delta above 0.8 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
Warning: There were 4 chains where the estimated Bayesian Fraction of Missing Information was low. See
http://mc-stan.org/misc/warnings.html#bfmi-low
Warning: Examine the pairs() plot to diagnose sampling problems
res<-monitornew(fit_cp2, warmup=0, digits_summary = 2, probs=0.5, print = FALSE)
round(res[,c('sRhat','zsRhat','fzsRhat','reff','zsreff')],2)
sRhat zsRhat fzsRhat reff zsreff
mu 1.00 1.00 1.00 0.07 0.07
tau 1.01 1.03 1.00 0.03 0.01
theta[1] 1.00 1.00 1.00 0.22 0.20
theta[2] 1.00 1.00 1.00 0.25 0.22
theta[3] 1.00 1.00 1.00 0.13 0.11
theta[4] 1.00 1.00 1.00 0.26 0.22
theta[5] 1.00 1.00 1.00 0.11 0.09
theta[6] 1.00 1.00 1.00 0.16 0.15
theta[7] 1.00 1.00 1.00 0.18 0.16
theta[8] 1.00 1.00 1.00 0.26 0.21
lp__ 1.02 1.03 1.01 0.01 0.01
Some rank-normalized split-Rhats are still larger than 1.01. Relative efficiencies for tau and lp__ are around 1%. Drop in the relative efficiency when the the sample size was increased indicates serious problems in mixing.
We examine the relative efficiency of probabilities between quantiles (0, 0.05, …, 1) for tau.
samp <- as.array(fit_cp2)
plot_local_reff(samp = samp, parami = 2, nalpha = 50)
We see that MCMC has difficulties in exploring small tau values. As the efficiency for estimating small tau values is practically zero, we may assume that we may miss substantial amount of posterior mass and get biased estimates.
We can examine also other parameters, like mu in the following plot.
samp <- as.array(fit_cp2)
plot_local_reff(samp = samp, parami = 1, nalpha = 40)
There are gaps of poor efficiency which indicates sticking of MCMC, but the sticking doesn’t occus for any specific range of values of mu.
Rank plot visualisation of tau, which has the smallest relative efficiency 0.01 among parameters.
mcmc_hist(r_scale(samp[,,2]), breaks = seq(0,prod(dim(samp)[1:2]),by=prod(dim(samp)[1:2])/100)+0.5)
Clear problems to sample small values of tau.
Rank plot visualisation of lp__, which has relative efficiency 0.01.
mcmc_hist(r_scale(samp[,,11]), breaks = seq(0,prod(dim(samp)[1:2]),by=prod(dim(samp)[1:2])/100)+0.5)
Now clear problems sampling different energy levels.
And for further evidence let’s check 100 times longer chains than the default.
fit_cp3 <- stan(file='eight_schools_cp.stan', data=input_data,
iter=200000, chains=4, seed=483892929, refresh=0)
Warning: There were 16912 divergent transitions after warmup. Increasing adapt_delta above 0.8 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
Warning: There were 4 chains where the estimated Bayesian Fraction of Missing Information was low. See
http://mc-stan.org/misc/warnings.html#bfmi-low
Warning: Examine the pairs() plot to diagnose sampling problems
res<-monitornew(fit_cp3, warmup=0, digits_summary = 2, probs=0.5, print = FALSE)
round(res[,c('sRhat','zsRhat','fzsRhat','reff','zsreff')],2)
sRhat zsRhat fzsRhat reff zsreff
mu 1.00 1.00 1 0.04 0.03
tau 1.00 1.01 1 0.01 0.00
theta[1] 1.00 1.00 1 0.03 0.03
theta[2] 1.00 1.00 1 0.05 0.04
theta[3] 1.00 1.00 1 0.15 0.09
theta[4] 1.00 1.00 1 0.08 0.06
theta[5] 1.00 1.00 1 0.11 0.09
theta[6] 1.00 1.00 1 0.13 0.09
theta[7] 1.00 1.00 1 0.02 0.02
theta[8] 1.00 1.00 1 0.08 0.05
lp__ 1.01 1.01 1 0.00 0.00
Some rank-normalized split-Rhats are still larger than 1.01. Relative efficiencies for tau and lp__ are around 0% and relative efficiencies for other parameters are also getting smaller. Drop in the relative efficiency when the the sample size was increased indicates serious problems in mixing.
We examine the relative efficiency of probabilities between quantiles (0, 0.05, …, 1) for tau.
samp <- as.array(fit_cp3)
plot_local_reff(samp = samp, parami = 2, nalpha = 100)
We see that MCMC has difficulties in exploring small tau values. As the efficiency for estimating small tau values is practically zero, we may assume that we may miss substantial amount of posterior mass and get biased estimates.
Rank plot visualisation of tau, which has the smallest relative efficiency 0.00 among parameters.
mcmc_hist(r_scale(samp[,,2]), breaks = seq(0,prod(dim(samp)[1:2]),by=prod(dim(samp)[1:2])/100)+0.5)
Clear problems to sample small values of tau and even with 100000 draws per chain, the plots don’t get crowded as traceplots would.
Rank plot visualisation of lp__, which has relative efficiency 0.01.
mcmc_hist(r_scale(samp[,,11]), breaks = seq(0,prod(dim(samp)[1:2]),by=prod(dim(samp)[1:2])/100)+0.5)
Now clear problems sampling different energy levels.
For hierarchical models, non-centered parameterization often works better
writeLines(readLines("eight_schools_ncp.stan"))
data {
int<lower=0> J;
real y[J];
real<lower=0> sigma[J];
}
parameters {
real mu;
real<lower=0> tau;
real theta_tilde[J];
}
transformed parameters {
real theta[J];
for (j in 1:J)
theta[j] = mu + tau * theta_tilde[j];
}
model {
mu ~ normal(0, 5);
tau ~ cauchy(0, 5);
theta_tilde ~ normal(0, 1);
y ~ normal(theta, sigma);
}
Run non-centered parameterization model with default options.
input_data <- read_rdump("eight_schools.data.R")
fit_ncp <- stan(file='eight_schools_ncp.stan', data=input_data,
iter=2000, chains=4, seed=483892929, refresh=0)
Warning: There were 3 divergent transitions after warmup. Increasing adapt_delta above 0.8 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
Warning: Examine the pairs() plot to diagnose sampling problems
We still observe some divergent transitions with the default adapt_delta and no divergences with adapt_delta=0.95. Let’s analyze both samples.
res<-monitornew(fit_ncp, warmup=0, digits_summary = 2, probs=0.5, print = FALSE)
round(res[,c('sRhat','zsRhat','fzsRhat','reff','zsreff')],2)
sRhat zsRhat fzsRhat reff zsreff
mu 1 1 1 1.06 1.07
tau 1 1 1 0.66 0.57
theta_tilde[1] 1 1 1 1.27 1.29
theta_tilde[2] 1 1 1 1.33 1.35
theta_tilde[3] 1 1 1 1.33 1.34
theta_tilde[4] 1 1 1 1.29 1.32
theta_tilde[5] 1 1 1 1.25 1.25
theta_tilde[6] 1 1 1 1.28 1.30
theta_tilde[7] 1 1 1 1.13 1.15
theta_tilde[8] 1 1 1 1.48 1.50
theta[1] 1 1 1 0.95 1.02
theta[2] 1 1 1 1.07 1.11
theta[3] 1 1 1 1.05 1.13
theta[4] 1 1 1 1.12 1.16
theta[5] 1 1 1 1.15 1.19
theta[6] 1 1 1 0.99 1.05
theta[7] 1 1 1 0.97 1.07
theta[8] 1 1 1 1.12 1.18
lp__ 1 1 1 0.36 0.38
All Rhats are close to 1, and relative efficiencies are good.
We examine the relative efficiency of probabilities between quantiles (0, 0.05, …, 1) for tau.
samp <- as.array(fit_ncp)
plot_local_reff(samp = samp, parami = 2, nalpha = 20)
Small tau values are still more difficult to explore, but the relative efficiency is in a good range.
Rank plot visualisation of tau, which has the smallest relative efficiency 0.57 among parameters.
mcmc_hist(r_scale(samp[,,2]), breaks = seq(0,prod(dim(samp)[1:2]),by=prod(dim(samp)[1:2])/50)+0.5)
I think these show problems to sample small values of tau.
Rank plot visualisation of lp__, which has relative efficiency 0.01.
mcmc_hist(r_scale(samp[,,19]), breaks = seq(0,prod(dim(samp)[1:2]),by=prod(dim(samp)[1:2])/50)+0.5)
Now clear problems sampling different energy levels.
adapt_delta=0.95Next we examine the same model but with higher adapt_delta=0.95.
fit_ncp2 <- stan(file='eight_schools_ncp.stan', data=input_data,
iter=2000, chains=4, control=list(adapt_delta=0.95), seed=483892929, refresh=0)
We get zero divergences with adapt_delta=0.95.
res<-monitornew(fit_ncp2, warmup=0, digits_summary = 2, probs=0.5, print = FALSE)
round(res[,c('sRhat','zsRhat','fzsRhat','reff','zsreff')],2)
sRhat zsRhat fzsRhat reff zsreff
mu 1 1 1 1.15 1.17
tau 1 1 1 0.83 0.62
theta_tilde[1] 1 1 1 1.09 1.11
theta_tilde[2] 1 1 1 1.16 1.17
theta_tilde[3] 1 1 1 1.23 1.23
theta_tilde[4] 1 1 1 1.26 1.28
theta_tilde[5] 1 1 1 1.15 1.16
theta_tilde[6] 1 1 1 1.29 1.29
theta_tilde[7] 1 1 1 1.10 1.11
theta_tilde[8] 1 1 1 1.19 1.19
theta[1] 1 1 1 0.99 1.07
theta[2] 1 1 1 1.24 1.29
theta[3] 1 1 1 1.07 1.13
theta[4] 1 1 1 1.24 1.27
theta[5] 1 1 1 1.17 1.20
theta[6] 1 1 1 1.11 1.18
theta[7] 1 1 1 1.18 1.22
theta[8] 1 1 1 1.03 1.10
lp__ 1 1 1 0.44 0.44
All Rhats are close to 1, and relative efficiencies are good and slightly better than with the default adapt_delta.
We examine the relative efficiency of probabilities between quantiles (0, 0.05, …, 1) for tau.
samp <- as.array(fit_ncp2)
plot_local_reff(samp = samp, parami = 2, nalpha = 20)
Small tau values are still more difficult to explore, but the relative efficiency is in a good range.
Check also with a finer resolution
samp <- as.array(fit_ncp2)
plot_local_reff(samp = samp, parami = 2, nalpha = 50)
Also with a finer resolution the efficiency of MCMC is good for different values of tau.
Rank plot visualisation of tau, which has the smallest relative efficiency 0.62 among parameters.
mcmc_hist(r_scale(samp[,,2]), breaks = seq(0,prod(dim(samp)[1:2]),by=prod(dim(samp)[1:2])/100)+0.5)
Higher adapt_delta seems to have improved sampling of small values.
Rank plot visualisation of lp__, which has relative efficiency 0.44.
mcmc_hist(r_scale(samp[,,19]), breaks = seq(0,prod(dim(samp)[1:2]),by=prod(dim(samp)[1:2])/100)+0.5)
Now clear problems sampling different energy levels.
adapt_delta=0.95 and longer chainsIf in doubt, we can run longer chains.
input_data <- read_rdump("eight_schools.data.R")
fit_ncp3 <- stan(file='eight_schools_ncp.stan', data=input_data,
iter=20000, chains=4, control=list(adapt_delta=0.95), seed=483892929, refresh=0)
res<-monitornew(fit_ncp3, warmup=0, digits_summary = 2, probs=0.5, print = FALSE)
round(res[,c('sRhat','zsRhat','fzsRhat','reff','zsreff')],2)
sRhat zsRhat fzsRhat reff zsreff
mu 1 1 1 1.14 1.15
tau 1 1 1 0.79 0.64
theta_tilde[1] 1 1 1 1.24 1.25
theta_tilde[2] 1 1 1 1.38 1.39
theta_tilde[3] 1 1 1 1.36 1.37
theta_tilde[4] 1 1 1 1.33 1.33
theta_tilde[5] 1 1 1 1.28 1.28
theta_tilde[6] 1 1 1 1.35 1.36
theta_tilde[7] 1 1 1 1.16 1.16
theta_tilde[8] 1 1 1 1.39 1.40
theta[1] 1 1 1 1.01 1.08
theta[2] 1 1 1 1.26 1.30
theta[3] 1 1 1 1.10 1.16
theta[4] 1 1 1 1.22 1.24
theta[5] 1 1 1 1.22 1.25
theta[6] 1 1 1 1.10 1.15
theta[7] 1 1 1 1.12 1.14
theta[8] 1 1 1 1.11 1.18
lp__ 1 1 1 0.41 0.41
All Rhats are close to 1, and relative efficiencies are good and slightly better than with the default adapt_delta. Relative efficiency is similar as for shorter chain, which means that running longer chains is giving us higher effective sample size.
We examine the relative efficiency of probabilities between quantiles (0, 0.05, …, 1) for tau.
samp <- as.array(fit_ncp3)
plot_local_reff(samp = samp, parami = 2, nalpha = 100)
Small tau values are still more difficult to explore, but the relative efficiency is in a good range.
Rank plot visualisation of tau, which has the smallest relative efficiency 0.64 among parameters.
mcmc_hist(r_scale(samp[,,2]), breaks = seq(0,prod(dim(samp)[1:2]),by=prod(dim(samp)[1:2])/200)+0.5)
Higher adapt_delta seems to have improved sampling of small values.
Rank plot visualisation of lp__, which has relative efficiency 0.41.
mcmc_hist(r_scale(samp[,,19]), breaks = seq(0,prod(dim(samp)[1:2]),by=prod(dim(samp)[1:2])/200)+0.5)
Now clear problems sampling different energy levels.
We have already seen that the relatice efficiency of HMC/NUTS can be higher than with independent draws. The next example illustrates interesting relative efficiency phenomea due to properties of HMC/NUTS algorithm.
writeLines(readLines("normal.stan"))
data {
int<lower=1> J;
}
parameters {
vector[J] x;
}
model {
x ~ normal(0, 1);
}
fit_n <- stan(file='normal.stan', data=data.frame(J=16),
iter=20000, chains=4, seed=483892929, refresh=0)
samp<-as.array(fit_n)
res<-monitornew(samp, warmup=0, digits_summary = 2, probs=0.5, print = FALSE)
round(res[,c('sRhat','zsRhat','fzsRhat','reff','zsreff')],2)
sRhat zsRhat fzsRhat reff zsreff
x[1] 1 1 1 2.53 2.55
x[2] 1 1 1 2.67 2.68
x[3] 1 1 1 2.54 2.56
x[4] 1 1 1 2.53 2.54
x[5] 1 1 1 2.49 2.50
x[6] 1 1 1 2.51 2.53
x[7] 1 1 1 2.49 2.50
x[8] 1 1 1 2.64 2.65
x[9] 1 1 1 2.55 2.56
x[10] 1 1 1 2.62 2.63
x[11] 1 1 1 2.54 2.55
x[12] 1 1 1 2.50 2.52
x[13] 1 1 1 2.59 2.60
x[14] 1 1 1 2.49 2.51
x[15] 1 1 1 2.58 2.59
x[16] 1 1 1 2.68 2.69
lp__ 1 1 1 0.36 0.35
The relative efficiency for all \(x\) is larger than 2. The relative efficiency for lp__ is however only 0.35. If look at the all Stan examples in this notebook, we see that the relative efficiency for lp__ is always below 0.5. lp__ correlates strongly with the total energy in HMC, and that total energy is sampled using random walk proposal once per iteration and thus it’s likely that lp__ has some random walk behavior leading to autocorrelation and relative efficiency below 1. On the other hand, no-U-turn behavior and sampling from the trajectory makes the Markov chain to be antithetic with negative odd lag correlation and relative efficiency higher than 1.
Let’s check the local relative efficiency
plot_local_reff(samp = samp, parami = 1, nalpha = 100)
The relative efficiency for probability estimate for a small interval is close to 1 with a slight drop in the tails. This is a good reault, but far from the relative efficiency for mean estimate.
The total energy of HMC should affect how far in the tails a chain in one iteration can go. Far tails of the target have high energy, and only chains with high total energy can then reach there. This will suggest that, the random walk in total energy would cause random walk in variance of \(x\). Let’s check the second moment of \(x\).
samp<-as.array(fit_n)^2
res<-monitornew(samp, warmup=0, digits_summary = 2, probs=0.5, print = FALSE)
round(res[,c('sRhat','zsRhat','fzsRhat','reff','zsreff')],2)
sRhat zsRhat fzsRhat reff zsreff
x[1] 1 1 1 0.36 0.39
x[2] 1 1 1 0.37 0.41
x[3] 1 1 1 0.38 0.41
x[4] 1 1 1 0.37 0.40
x[5] 1 1 1 0.37 0.41
x[6] 1 1 1 0.37 0.42
x[7] 1 1 1 0.35 0.39
x[8] 1 1 1 0.36 0.40
x[9] 1 1 1 0.38 0.41
x[10] 1 1 1 0.36 0.40
x[11] 1 1 1 0.37 0.41
x[12] 1 1 1 0.36 0.40
x[13] 1 1 1 0.36 0.41
x[14] 1 1 1 0.37 0.41
x[15] 1 1 1 0.37 0.40
x[16] 1 1 1 0.36 0.41
lp__ 1 1 1 0.39 0.35
Mean of the relative efficiencies for \(x[j]^2\) is 0.4, which is quite close to the relative efficiency for lp__. This is not that surprising as the potential energy in normal model is relative to \(\sum_{j=1}^J x_j^2\).
Let’s check the local relative efficiency
plot_local_reff(samp = samp, parami = 1, nalpha = 100)
The relative efficiency is mostly a bit below 1, but for the right tail of \(x[1]^2\) the relative efficiency drops. This likely due to only some iterations have high enough total energy to obtain draws from the high energy part of the tail.
We can see the correlation between lp__ and magnitude of \(x[1]\) in the following plot.
samp<-as.array(fit_n)
qplot(as.vector(samp[,,17]),abs(as.vector(samp[,,1]))) + xlab('lp__') + ylab('x[1]')
Low lp__ corresponds to high energy and more variation in \(x\), and high lp__ corresponds to low energy and small variation in \(x\).
Finally \(\sum_{j=1}^J x_j^2\) is perfectly correlated with lp__.
qplot(as.vector(samp[,,17]),as.vector(apply(samp[,,1:16]^2,c(1,2),sum))) + xlab('lp__') + ylab('sum(x^2)')
This shows that even if we get high relative efficiency estimates for parameters, it is important to look at the relative efficiency of lp__ as it can indicate problems of sampling in tails, which would affect the accuracy of variance and tail quantile estimates.
writeLines(readLines(file.path(Sys.getenv("HOME"), ".R/Makevars")))
CXXFLAGS=-O3 -mtune=native -march=native -Wno-unused-variable -Wno-unused-function
CXXFLAGS+=-flto -ffat-lto-objects -Wno-unused-local-typedefs
CXXFLAGS+=-std=c++11
CFLAGS+=-O3
devtools::session_info("rstan")
Session info -------------------------------------------------------------
setting value
version R version 3.4.4 (2018-03-15)
system x86_64, linux-gnu
ui X11
language en_GB:en
collate en_US.UTF-8
tz Europe/Helsinki
date 2018-06-06
Packages -----------------------------------------------------------------
package * version date source
assertthat 0.2.0 2017-04-11 CRAN (R 3.4.3)
BH 1.66.0-1 2018-02-13 cran (@1.66.0-)
cli 1.0.0 2017-11-05 CRAN (R 3.4.3)
colorspace 1.3-2 2016-12-14 CRAN (R 3.4.3)
compiler 3.4.4 2018-03-16 local
crayon 1.3.4 2017-09-16 CRAN (R 3.4.3)
dichromat 2.0-0 2013-01-24 CRAN (R 3.4.3)
digest 0.6.15 2018-01-28 CRAN (R 3.4.3)
ggplot2 * 2.2.1 2016-12-30 CRAN (R 3.4.3)
glue 1.2.0 2017-10-29 CRAN (R 3.4.3)
graphics * 3.4.4 2018-03-16 local
grDevices * 3.4.4 2018-03-16 local
grid 3.4.4 2018-03-16 local
gridExtra * 2.3 2017-09-09 CRAN (R 3.4.3)
gtable 0.2.0 2016-02-26 CRAN (R 3.4.3)
inline 0.3.14 2015-04-13 CRAN (R 3.4.3)
labeling 0.3 2014-08-23 CRAN (R 3.4.3)
lattice 0.20-35 2017-03-25 CRAN (R 3.3.3)
lazyeval 0.2.1 2017-10-29 CRAN (R 3.4.3)
loo 2.0.0.9000 2018-05-23 Github (stan-dev/loo@ae4714e)
magrittr 1.5 2014-11-22 CRAN (R 3.4.3)
MASS 7.3-50 2018-04-30 CRAN (R 3.4.4)
Matrix 1.2-13 2018-04-02 CRAN (R 3.4.4)
matrixStats 0.53.1 2018-02-11 cran (@0.53.1)
methods * 3.4.4 2018-03-16 local
munsell 0.4.3 2016-02-13 CRAN (R 3.4.3)
parallel 3.4.4 2018-03-16 local
pillar 1.2.1 2018-02-27 CRAN (R 3.4.4)
plyr 1.8.4 2016-06-08 CRAN (R 3.4.3)
R6 2.2.2 2017-06-17 CRAN (R 3.4.3)
RColorBrewer 1.1-2 2014-12-07 CRAN (R 3.4.3)
Rcpp 0.12.17 2018-05-18 cran (@0.12.17)
RcppEigen 0.3.3.4.0 2018-02-07 CRAN (R 3.4.3)
reshape2 1.4.3 2017-12-11 CRAN (R 3.4.3)
rlang 0.2.0 2018-02-20 cran (@0.2.0)
rstan * 2.17.3 2018-04-08 local
scales 0.5.0 2017-08-24 CRAN (R 3.4.3)
StanHeaders * 2.17.2 2018-01-20 CRAN (R 3.4.3)
stats * 3.4.4 2018-03-16 local
stats4 3.4.4 2018-03-16 local
stringi 1.1.7 2018-03-12 cran (@1.1.7)
stringr 1.3.0 2018-02-19 cran (@1.3.0)
tibble 1.4.2 2018-01-22 CRAN (R 3.4.3)
tools 3.4.4 2018-03-16 local
utf8 1.1.3 2018-01-03 CRAN (R 3.4.3)
utils * 3.4.4 2018-03-16 local
viridisLite 0.3.0 2018-02-01 CRAN (R 3.4.3)